This report aims to investigate factors that affected the implementation of the global COVID-19 vaccination rollout. Specifically, this report investigates the time lag between the 1st and 2nd vaccine dose inoculation to quantitatively determine the delay in the vaccinated population achieving full immunity. It is found that the majority of country time lags are between 1-2 months, showing that the multiple dose nature of the vaccine rollout causes a considerable delay in achieving full immunity.
In addition, the influence of various demographic, socio-economic and health factors on vaccination speed and coverage is algorithmically assessed. It is found that the log of cardiovascular disease, GHS index and life satisfaction have the greatest influence on the vaccination rollout index, an index that conveys both speed and coverage of the vaccination rollout.
These results can be utilised by epidemiologists or research students to inform their research into how to achieve greater success and efficiency with future vaccine rollouts. The analysis in this report has additionally been represented in an interactive Shiny application.
The following schematic diagram (Figure 1.1) shows the steps undertaken to investigate the two main questions in this study:
Figure 1.1: The flow chart of the project working process
Since the 8th of December 2020, COVID-19 vaccines have been developed and administered to the public. There has been strong evidence of the success of COVID-19 vaccines in reducing fatalities, severe infections, and hospitalisations (Deb et al., 2021). Given the effectiveness of vaccines, it is appropriate to evaluate potential factors affecting the speed and coverage of the vaccine rollout. If certain demographic, socio-economic or health factors can be shown to influence vaccine rollout, it can help address efforts to improve vaccine uptake outcomes.
The speed at which the vaccinated population was provided full immunity was affected by the multiple dose design of the vaccine rollout. One dose provides partial immunity and, after a length of time, an additional dose is required to achieve full immunity. The delay in achieving full immunity is a key factor in how quickly full immunity can be provided to a population, influencing the effectiveness of a multiple dose rollout.
The COVID-19 time-series dataset provided by the ‘Our World in Data’ online publication (Hannah Ritchie & Roser, 2020) was the primary dataset analysed. This dataset contains data on 207 countries, with measured attributes such as COVID-19 cases, deaths, tests and vaccinations. A COVID vaccination policy dataset from the same source (Hannah Ritchie & Roser, 2020) was also explored in this report, providing time-series data for COVID-19 vaccination availability in 207 countries, represented as a categorical index.
The following datasets were used to provide predictive factors for the vaccine rollout:
The Global health security (GHS) index data, was sourced from the GHS index repository. It measures the capacity for 195 countries to prepare for pandemics from 2021 to 2022. (“GHS Index Report and Data,” 2022)
‘Our World in Data: Happiness and Life Satisfaction’ dataset contains variables providing measures of happiness and life satisfaction on 160 countries. (Ortiz-Ospina & Roser, 2013)
The Corruption Perceptions Index dataset contains data of perceptions of public sector corruption. The index ranges from 0 to 100, whereby 0 corresponds to maximal corruption. (“2020 Corruption Perceptions Index - Explore the Results,” 2021)
To calculate the time lag between a vaccinated population receiving the 1st and 2nd vaccine dose, the distance/difference in days between the people vaccinated (at least 1 vaccine dose) and people fully vaccinated (2 vaccine doses) time-series attributes was calculated. This was achieved by continuously shifting the time-series attribute curves until the curves, which both follow logistic functions, are roughly overlapping. When overlapping occurs, the distance between the two time-series attributes is minimized. The distance that will be used in this analysis is Euclidean distance. The number of times the curves need to be shifted to achieve minimal distance is the difference in days between the two time-series attributes.
The time lag will be calculated for each country, during the period of the 1st of February 2021 to the 1st of August 2021, except for a subset of countries that do not have a population greater than 1,500,000 people and whose data on vaccine uptake is not adequate.
The vaccine availability policy data for each country contains an index representing which groups of people can access vaccines, with larger index values representing greater access. For each stage of availability, the speed of vaccination was derived using the slope of the number of people vaccinated over time in each period. As evident in the bar chart (Figure 3.1) below, stages 3 and 4 have the highest speed and stage 5 has a lower speed. Despite stage 5 allowing greater access, this decrease in vaccination speed can be attributed to a significant proportion of the population being already vaccinated in stages 3 and 4. Greater availability doesn’t correlate with higher vaccination speed. Rather, there are other factors that affect vaccination uptake.
Figure 3.1: The vaccination speed for different policy periods for selected countries
An analysis of how factors affect vaccine uptake requires a measure of vaccine uptake. Inspired by M.Kazemi’s research (Kazemi, Bragazzi, & Kong, 2022), we used a vaccination roll-out index (VRI) to represent both vaccination uptake rate (speed) and coverage in each country. It is defined as
\[VRI=r\cdot \frac{d}{N} \quad\]
whereby, \(r\) is the vaccination uptake rate, \(d\) is the total number of people vaccinated, and \(N\) is the total population. The \(d/N\) then measures the coverage of vaccination while \(r\) measures the speed. To estimate \(r\), two methods were undertaken.
It is observed that in most countries, the number of people vaccinated roughly follows a logistic growth model, which is consistent with the observations of the aforementioned report (Kazemi, Bragazzi, & Kong, 2022). Hence, we used a logistic regression to fit the people vaccinated attribute:
\[c(t) = \frac{K}{1+a{e}^{-rt}}\]
where \(c(t)\) is the number of people vaccinated as a function of time \(t\),
\(K\) is the number of people vaccinated at the end of the date recorded (the asymptote) and
\(r\) is the vaccination uptake rate.
The \(r\) is estimated from the inverse of the parameter scal of the non-linear least squares model fitted in R.
Another approach is investigated to estimate the vaccination uptake rate. The log of the people vaccinated attribute is similar to an asymptotic regression, where the rate of change is a function of the log people vaccinated:
\[ \log (c(t)) = K + (c(0) - K) e^{-rt}\\ \frac{d}{dt} (log(c(t)) = r(K - \log (c(t)) \]
Here \(c(t)\) and \(c(0)\) are the same as defined above,
\(K\) is the y-asymptote level, and
\(r\) is the vaccine uptake rate.
The \(r\) is estimated from the exponential of the parameter lrc of the non-linear least squares model fitted in R.
The residual standard error (RSE) was used to assess the fit of the models to the people vaccinated attribute. The mean RSE over all the locations is \(5.0218802\times 10^{5}\) for the logistic model, which is lower than \(1.0581885\times 10^{6}\) for the asymptotic regression. (Figure A.1) Therefore, the logistic model was the most appropriate method for estimating the vaccination uptake parameter.
Multiple socio-economic, demographic and health-related factors have been considered as potential predictors of a country’s VRI. (Ataguba, Ojo, & Ichoku, 2016; Maleva, Kartseva, & Korzhuk, 2021) The full list of predictors used is documented in the Appendix (Table A.1). Log transformation is performed to some attributes to mitigate the problem of highly skewed data. The different scales of variables necessitated min-max normalization to restrict all variables to within 0 and 1.
For each predictor, we investigated its relationship with VRI using a scatter plot as well as a Spearman correlation heatmap. All plots are located in the Appendix (Figure A.2).
Random forest (RF) regression is a non-parametric algorithm that can handle feature selection problems with numerous highly correlated variables. (Dewi, 2019) The nature of bagging and ensemble techniques used in RF also reduces the likelihood of the model being adversely affected by outliers. Further, the calculation of variable importance in RF covers not only the impact of predictor variables individually but also interactively with other predictor variables. (Strobl, Boulesteix, Kneib, Augustin, & Zeileis, 2008) Hence, RF is an appropriate algorithm for modeling VRIs with the predictor attributes. The association between predictor attributes and VRI was measured by calculating their conditional permutation importance (CPI). The original permutation importance was measured by the mean decrease in error of a RF before and after permuting its values, reflecting its contribution to modeling the response variable. CPI enhanced the permutation approach whereby the permutation of a predictor will be based on the variable with which it is significantly related to, according to a user-defined threshold ([0,1]), thereby reducing the bias resulting from correlated covariates when computing variable importance. (D. Debeer & Strobl, 2020)
A hyper-parameter grid search was run, with 5-fold cross-validation, to find the optimal hyper-parameters for the RF (number of trees and number of features included in the split), based on mean squared error and \(R^2\) values.
The median time lag of 112 countries was 40.5 days. Cambodia had the smallest time lag value: 18 days. The maximum value was > 99 days, the value for multiple countries: Algeria, Congo, Libya, Sri Lanka and Malawi. The interquartile range, a measure of the spread of time lag values, was 31.5 days. The time lag values for individual countries can be seen in Table 4.1 below.
| LagType | Lag: Euclidean distance | LagType | Lag: Manhattan distance | LagType | Lag: Minkowksi distance (P=3) |
|---|---|---|---|---|---|
| Second Dose Lag | 21 | Second Dose Lag | 15 | Second Dose Lag | 13 |
| Second Dose Lag | 99 | Second Dose Lag | 99 | Second Dose Lag | 99 |
| Second Dose Lag | 52 | Second Dose Lag | 52 | Second Dose Lag | 52 |
| Second Dose Lag | 85 | Second Dose Lag | 82 | Second Dose Lag | 82 |
| Second Dose Lag | 47 | Second Dose Lag | 47 | Second Dose Lag | 46 |
| Second Dose Lag | 64 | Second Dose Lag | 64 | Second Dose Lag | 64 |
| Second Dose Lag | 44 | Second Dose Lag | 43 | Second Dose Lag | 43 |
| Second Dose Lag | 34 | Second Dose Lag | 34 | Second Dose Lag | 34 |
| Second Dose Lag | 25 | Second Dose Lag | 28 | Second Dose Lag | 29 |
| Second Dose Lag | 30 | Second Dose Lag | 35 | Second Dose Lag | 37 |
| Second Dose Lag | 43 | Second Dose Lag | 41 | Second Dose Lag | 40 |
| Second Dose Lag | 54 | Second Dose Lag | 54 | Second Dose Lag | 54 |
| Second Dose Lag | 43 | Second Dose Lag | 42 | Second Dose Lag | 42 |
| Second Dose Lag | 69 | Second Dose Lag | 61 | Second Dose Lag | 60 |
| Second Dose Lag | 40 | Second Dose Lag | 41 | Second Dose Lag | 41 |
| Second Dose Lag | 18 | Second Dose Lag | 19 | Second Dose Lag | 20 |
| Second Dose Lag | 65 | Second Dose Lag | 64 | Second Dose Lag | 64 |
| Second Dose Lag | 93 | Second Dose Lag | 93 | Second Dose Lag | 93 |
| Second Dose Lag | 27 | Second Dose Lag | 30 | Second Dose Lag | 30 |
| Second Dose Lag | 30 | Second Dose Lag | 29 | Second Dose Lag | 29 |
| Second Dose Lag | 64 | Second Dose Lag | 64 | Second Dose Lag | 64 |
| Second Dose Lag | 30 | Second Dose Lag | 30 | Second Dose Lag | 30 |
| Second Dose Lag | 99 | Second Dose Lag | 98 | Second Dose Lag | 98 |
| Second Dose Lag | 36 | Second Dose Lag | 46 | Second Dose Lag | 48 |
| Second Dose Lag | 43 | Second Dose Lag | 44 | Second Dose Lag | 45 |
| Second Dose Lag | 43 | Second Dose Lag | 42 | Second Dose Lag | 42 |
| Second Dose Lag | 35 | Second Dose Lag | 35 | Second Dose Lag | 35 |
| Second Dose Lag | 37 | Second Dose Lag | 37 | Second Dose Lag | 37 |
| Second Dose Lag | 51 | Second Dose Lag | 50 | Second Dose Lag | 50 |
| Second Dose Lag | 59 | Second Dose Lag | 62 | Second Dose Lag | 63 |
| Second Dose Lag | 34 | Second Dose Lag | 34 | Second Dose Lag | 34 |
| Second Dose Lag | 84 | Second Dose Lag | 84 | Second Dose Lag | 84 |
| Second Dose Lag | 44 | Second Dose Lag | 45 | Second Dose Lag | 45 |
| Second Dose Lag | 87 | Second Dose Lag | 88 | Second Dose Lag | 88 |
| Second Dose Lag | 37 | Second Dose Lag | 37 | Second Dose Lag | 37 |
| Second Dose Lag | 45 | Second Dose Lag | 44 | Second Dose Lag | 43 |
| Second Dose Lag | 29 | Second Dose Lag | 29 | Second Dose Lag | 29 |
| Second Dose Lag | 68 | Second Dose Lag | 68 | Second Dose Lag | 68 |
| Second Dose Lag | 82 | Second Dose Lag | 82 | Second Dose Lag | 82 |
| Second Dose Lag | 23 | Second Dose Lag | 23 | Second Dose Lag | 23 |
| Second Dose Lag | 64 | Second Dose Lag | 64 | Second Dose Lag | 64 |
| Second Dose Lag | 26 | Second Dose Lag | 25 | Second Dose Lag | 25 |
| Second Dose Lag | 36 | Second Dose Lag | 35 | Second Dose Lag | 35 |
| Second Dose Lag | 98 | Second Dose Lag | 94 | Second Dose Lag | 92 |
| Second Dose Lag | 48 | Second Dose Lag | 42 | Second Dose Lag | 42 |
| Second Dose Lag | 53 | Second Dose Lag | 56 | Second Dose Lag | 57 |
| Second Dose Lag | 61 | Second Dose Lag | 56 | Second Dose Lag | 49 |
| Second Dose Lag | 40 | Second Dose Lag | 38 | Second Dose Lag | 37 |
| Second Dose Lag | 25 | Second Dose Lag | 26 | Second Dose Lag | 27 |
| Second Dose Lag | 42 | Second Dose Lag | 42 | Second Dose Lag | 42 |
| Second Dose Lag | 84 | Second Dose Lag | 84 | Second Dose Lag | 84 |
| Second Dose Lag | 23 | Second Dose Lag | 23 | Second Dose Lag | 23 |
| Second Dose Lag | 36 | Second Dose Lag | 36 | Second Dose Lag | 36 |
| Second Dose Lag | 28 | Second Dose Lag | 30 | Second Dose Lag | 31 |
| Second Dose Lag | 46 | Second Dose Lag | 42 | Second Dose Lag | 41 |
| Second Dose Lag | 30 | Second Dose Lag | 24 | Second Dose Lag | 23 |
| Second Dose Lag | 27 | Second Dose Lag | 29 | Second Dose Lag | 29 |
| Second Dose Lag | 37 | Second Dose Lag | 35 | Second Dose Lag | 34 |
| Second Dose Lag | 77 | Second Dose Lag | 77 | Second Dose Lag | 77 |
| Second Dose Lag | 99 | Second Dose Lag | 99 | Second Dose Lag | 99 |
| Second Dose Lag | 35 | Second Dose Lag | 35 | Second Dose Lag | 34 |
| Second Dose Lag | 87 | Second Dose Lag | 87 | Second Dose Lag | 87 |
| Second Dose Lag | 99 | Second Dose Lag | 99 | Second Dose Lag | 99 |
| Second Dose Lag | 26 | Second Dose Lag | 26 | Second Dose Lag | 26 |
| Second Dose Lag | 60 | Second Dose Lag | 60 | Second Dose Lag | 60 |
| Second Dose Lag | 64 | Second Dose Lag | 67 | Second Dose Lag | 67 |
| Second Dose Lag | 34 | Second Dose Lag | 38 | Second Dose Lag | 39 |
| Second Dose Lag | 39 | Second Dose Lag | 38 | Second Dose Lag | 37 |
| Second Dose Lag | 30 | Second Dose Lag | 28 | Second Dose Lag | 27 |
| Second Dose Lag | 25 | Second Dose Lag | 25 | Second Dose Lag | 25 |
| Second Dose Lag | 61 | Second Dose Lag | 61 | Second Dose Lag | 62 |
| Second Dose Lag | 39 | Second Dose Lag | 39 | Second Dose Lag | 40 |
| Second Dose Lag | 41 | Second Dose Lag | 41 | Second Dose Lag | 41 |
| Second Dose Lag | 27 | Second Dose Lag | 27 | Second Dose Lag | 27 |
| Second Dose Lag | 26 | Second Dose Lag | 27 | Second Dose Lag | 27 |
| Second Dose Lag | 47 | Second Dose Lag | 48 | Second Dose Lag | 48 |
| Second Dose Lag | 69 | Second Dose Lag | 55 | Second Dose Lag | 55 |
| Second Dose Lag | 58 | Second Dose Lag | 56 | Second Dose Lag | 56 |
| Second Dose Lag | 41 | Second Dose Lag | 42 | Second Dose Lag | 43 |
| Second Dose Lag | 78 | Second Dose Lag | 79 | Second Dose Lag | 79 |
| Second Dose Lag | 53 | Second Dose Lag | 55 | Second Dose Lag | 56 |
| Second Dose Lag | 25 | Second Dose Lag | 25 | Second Dose Lag | 25 |
| Second Dose Lag | 38 | Second Dose Lag | 37 | Second Dose Lag | 37 |
| Second Dose Lag | 41 | Second Dose Lag | 36 | Second Dose Lag | 35 |
| First Dose Lag | 0 | First Dose Lag | 0 | First Dose Lag | 0 |
| Second Dose Lag | 23 | Second Dose Lag | 13 | Second Dose Lag | 11 |
| Second Dose Lag | 25 | Second Dose Lag | 25 | Second Dose Lag | 25 |
| Second Dose Lag | 27 | Second Dose Lag | 26 | Second Dose Lag | 25 |
| Second Dose Lag | 31 | Second Dose Lag | 31 | Second Dose Lag | 31 |
| Second Dose Lag | 29 | Second Dose Lag | 29 | Second Dose Lag | 28 |
| Second Dose Lag | 28 | Second Dose Lag | 32 | Second Dose Lag | 32 |
| Second Dose Lag | 39 | Second Dose Lag | 39 | Second Dose Lag | 39 |
| Second Dose Lag | 32 | Second Dose Lag | 33 | Second Dose Lag | 33 |
| Second Dose Lag | 36 | Second Dose Lag | 37 | Second Dose Lag | 37 |
| Second Dose Lag | 48 | Second Dose Lag | 51 | Second Dose Lag | 52 |
| Second Dose Lag | 35 | Second Dose Lag | 33 | Second Dose Lag | 33 |
| Second Dose Lag | 99 | Second Dose Lag | 99 | Second Dose Lag | 64 |
| Second Dose Lag | 49 | Second Dose Lag | 49 | Second Dose Lag | 48 |
| Second Dose Lag | 29 | Second Dose Lag | 29 | Second Dose Lag | 29 |
| Second Dose Lag | 83 | Second Dose Lag | 83 | Second Dose Lag | 83 |
| Second Dose Lag | 37 | Second Dose Lag | 41 | Second Dose Lag | 43 |
| Second Dose Lag | 89 | Second Dose Lag | 89 | Second Dose Lag | 89 |
| Second Dose Lag | 39 | Second Dose Lag | 41 | Second Dose Lag | 42 |
| Second Dose Lag | 38 | Second Dose Lag | 38 | Second Dose Lag | 38 |
| Second Dose Lag | 63 | Second Dose Lag | 50 | Second Dose Lag | 47 |
| Second Dose Lag | 58 | Second Dose Lag | 55 | Second Dose Lag | 5 |
| Second Dose Lag | 76 | Second Dose Lag | 75 | Second Dose Lag | 75 |
| Second Dose Lag | 28 | Second Dose Lag | 29 | Second Dose Lag | 30 |
| Second Dose Lag | 35 | Second Dose Lag | 33 | Second Dose Lag | 33 |
| Second Dose Lag | 41 | Second Dose Lag | 44 | Second Dose Lag | 45 |
| Second Dose Lag | 33 | Second Dose Lag | 32 | Second Dose Lag | 32 |
| Second Dose Lag | 87 | Second Dose Lag | 88 | Second Dose Lag | 88 |
| Second Dose Lag | 37 | Second Dose Lag | 40 | Second Dose Lag | 40 |
In evaluating the stability of the time lag derivation model, it is noted that the model is stable for different distance metrics. Results for time lags with distance metrics like Manhattan distance and Minkowski distance (p parameter equals 3) yielded similar results to the Euclidean distance time lags. (Figure 4.1)
Figure 4.1: Boxplot results of time lags for each distance metric.
Despite different ways of calculating distance, the model is still able to interpret the same roughly equivalent distances between the two time-series attributes.
A summary of VRIs are in Table A.2 in the appendix. The 7 right side outliers (A.3) were: Nauru(0.152543), Saint Helena(0.143903), Gibraltar(0.052261), Nicaragua(0.043231), San Marino(0.040336), Mongolia(0.039360) and Seychelles(0.039159). The countries with lowest VRIs are Burundi(0.000039), Yemen(0.000180), and Haiti(0.000411).
The random forest regressor with optimal hyper-parameters involved building 300 trees with 2 variables considered at a split. The MSE and R squared were approximately \(9.38 × 10^{-6}\) and \(0.8852\) respectively. The top 5 important predictors were:
Figure 4.2: The top 5 important variables
Evaluation of model accuracy involved assessing the mean square error (MSE) and the coefficient of determination \(R^2\). The MSE measures the closeness between prediction and observed value while \(R^2\) represents the proportion of the predictors’ variance explained by our index VRI. The lower the MSE and the higher the \(R^2\), the more accurate the model.
As VRI is highly related to time, the time period chosen can affect results. This concept is explored as: Input affects Output”. The period of time between commencement of vaccinations and the last day recorded varies from 1 to 1.5 years in the covid dataset. Therefore, the dataset was separated into \(0-4\) months, \(4-8\) months, and \(8-\text{latest}\) starting on the first day a country had vaccination data. For each selected period, the accuracy was compared using MSE and R squared.
The dataset between 4 to 8 months has the highest R squared and lowest MSE, 0.83 and 0.000013. For input between 0 to 4 months and 8 to the latest day, R squared is similar: approximately 0.78. However, the 8-month to latest data has the highest MSE, 0.000031, while 0 to 4 months data’s MSE is 0.000014. Accordingly, it is concluded that the time period of input data does affect model output which is most accurate on data between 4 to 8 months. (Appendix: Table A.3)
The results of our project and some analysis processes are illustrated on the Shiny app deployed on shinyapps.io. Two layers of heatmap are plotted on an interactive world map to visualise the comparison of results between locations. The “VRI Overlay” shows the VRI values (scaled into 0-1 using logistic regression) and the “Time Lag Overlay” shows the time lags (days) between different locations. Hover-over labels are used to display the specific values. The panel on the right shows the scaled VRI values for all locations and the table can be sorted to show the locations with the highest/lowest scaled VRI. The bottom panel shows location-specific plots and the time lag table for all locations. Users can either click a country on the map or search and select a country from the dropdown menu to see the location-specific information (default Australia). The time lag graph (left) shows the trends of people vaccinated and people fully vaccinated, so that the user can see if the time lag between the two doses changes with time and compare our calculated time lag with the graph. The vaccination trend graph (right) allows the user to see how well our logistic growth model fits the people vaccinated time series so that they can determine how reliable the VRI is.
The results of the time lags between 1st and 2nd vaccine doses convey that there is significant variation between countries. This can be mainly attributed to what brand of vaccines the majority of a country’s population is inoculated with. Different brands of vaccines had different recommended waiting times between doses e.g. the AstraZeneca vaccine had a longer waiting period than the Pfizer and Moderna vaccines (12 weeks vs. 3 & 4 weeks). Accordingly, the United Kingdom, whose population is mostly inoculated with the AstraZeneca vaccine, has a time lag of 76 days, in contrast with the United States 28 day time lag, whose population is mostly inoculated with the Pfizer and Moderna vaccines.
76 days is a substantial amount of time to be without full immunity, especially in contrast with 28 days. Hence, the judgment can be made that if the United Kingdom, or other countries that had long time lags due to the AstraZeneca vaccine, wanted to achieve full immunity of their vaccinated population as quickly as possible, they should have invested in Pfizer/Moderna vaccines.
Countries within the minimum value and quartile 1 had time lags of 18 days to 30.5 days. This is 3-4 weeks and reflects the expected time lag of countries who inoculated their populations with mostly Pfizer and Moderna vaccines. While these time lags are considerably shorter than other countries, they still reveal that there is a considerable period of time in which vaccinated people are still not fully protected from COVID-19. This identifies a flaw with the multiple dose rollout and encourages epidemiologists and researchers to explore methods of reducing the waiting periods with a multiple dose rollout or producing single dose vaccines that provide full immunity.
The results of variable importance show that the death rate of cardiovascular disease has the most influence on the VRI, distinctly more important than other predictors. This reflects that countries that perform comparatively well in managing health issues like cardiovascular disease are likely to perform well in ensuring their population is immunized. Global Health and Security Index and Life satisfaction are the next two most important variables indicating that countries that are prepared for epidemics and where the majority of people are satisfied with their lives will have better vaccine rollout outcomes
Some countries, e.g. The Democratic Republic of Congo, have excessive missing data for some variables, resulting in an inability to calculate time lags or VRIs, and thus the analysis cannot cover all countries. This effect gets worse as external datasets are added, containing fewer countries.
The discovered factors which affect vaccine uptake may not accurately represent the causal effect, as it’s a model-based result and will change accordingly when hyperparameters, user-defined threshold and even operating systems vary. Further, there may be other factors outside of the used datasets which may also affect vaccine uptake. The problem is that there is no source which contains these variables as comprehensive as the used datasets, which further limits the analysis of the research question.
With an aim of calculating importance instead of predicting VRI, all the data was used as the training set, which could lead to overfitting.
Collecting additional datasets containing other factors to test their effect on the vaccine uptake, such as type of vaccine, people’s trust in health institutions, public perception and individual lifestyle. Exploring more factors will give us a more holistic analysis on vaccination uptake.
The conditional permutation importance is sensitive to the choice of hyper-parameter and user-defined threshold. Though it’s proved to be hard to set sensible values, one future improvement is to evaluate how changes on these pre-set values affect the final result.
The conditional permutation importance is sensitive to the choice of hyper-parameter and user-defined threshold. Though it’s proved to be hard to set sensible values, one future improvement is to evaluate how changes on these pre-set values affect the final result.
The analysis on the influence of demographic, socio-economic and health factors on vaccination speed and coverage found the following factors most influential - the log of cardiovascular disease, GHS index and life satisfaction. This suggests that positive outcomes regarding these factors will correlate with positive vaccination outcomes.
The constructed Shiny application allows users to compare time lags between countries, trends in vaccination speed and availability of vaccines and the accuracy of the measures of vaccine uptake.
Future studies can build on this report by integrating new and more diverse data to explore more relationships between vaccination outcomes and various health, demographic and socio-economic factors.
Everyone researched and came up with ideas during exploration, some of which were recorded on our Github repository . David mainly developed the idea for the sub-questions while Sylvia researched and proposed ideas for VRI.
The model of Time lag was mainly coded and formulated by David, including visualizing and evaluating the results. Aurora and Tina collaborated on the data exploration on VRI based on vaccination availability, while Sylvia built the Random Forest model and calculated the variable importance with data cleaning helped by Vighnesh. Diana further refined the r estimation models, their evaluation, and the VRI calculation. Aurora evaluated the model on the effect of input time selection. Harun created clusters for VRI. Diana wrote all the Shiny app code and Vighnesh and Sylvia wrote the About page.
For the report, Vighnesh wrote the executive summary and collaborated with Tina on the Background and Data section. David contributed majorly to the time lag and result section. For the VRI, Sylvia wrote the methods, results, and discussion while Aurora wrote about exploration and evaluation. Diana wrote the goodness-of-fit evaluation for the asymptotic model and the Shiny app section. Harun wrote the limitations and future improvements along with proofreading the report.
Figure A.1: RSE comparison of logistic vs asymptotic model.
| Selected Variable Name |
|---|
| aged_65_older |
| gdp_per_capita |
| cardiovasc_death_rate |
| population_density |
| hospital_beds_per_thousand |
| human_development_index |
| extreme_poverty |
| diabetes_prevalence |
| life_expectancy |
| vri8 |
| vrif |
| CPI score 2020 |
Figure A.2: Interactive plots of Spearman’s correlation heatmap and scatter plots bewteen variables and VRI
| location | vri | |
|---|---|---|
| 1 | Afghanistan | 0.0024 |
| 2 | Albania | 0.0061 |
| 3 | Algeria | 0.0036 |
| 4 | Andorra | 0.0202 |
| 5 | Angola | 0.0069 |
| 6 | Anguilla | 0.0180 |
| 7 | Antigua and Barbuda | 0.0044 |
| 8 | Argentina | 0.0175 |
| 9 | Armenia | 0.0101 |
| 10 | Aruba | 0.0134 |
| 11 | Australia | 0.0189 |
| 12 | Austria | 0.0176 |
| 13 | Azerbaijan | 0.0118 |
| 14 | Bahamas | 0.0071 |
| 15 | Bahrain | 0.0182 |
| 16 | Bangladesh | 0.0135 |
| 17 | Barbados | 0.0062 |
| 18 | Belarus | 0.0102 |
| 19 | Belgium | 0.0246 |
| 20 | Belize | 0.0121 |
| 21 | Benin | 0.0089 |
| 22 | Bermuda | 0.0211 |
| 23 | Bhutan | 0.0036 |
| 24 | Bolivia | 0.0086 |
| 25 | Bonaire Sint Eustatius and Saba | NA |
| 26 | Bosnia and Herzegovina | 0.0064 |
| 27 | Botswana | 0.0104 |
| 28 | Brazil | 0.0186 |
| 29 | British Virgin Islands | 0.0109 |
| 30 | Brunei | 0.0264 |
| 31 | Bulgaria | 0.0045 |
| 32 | Burkina Faso | 0.0033 |
| 33 | Burundi | 0.0000 |
| 34 | Cambodia | 0.0254 |
| 35 | Cameroon | 0.0006 |
| 36 | Canada | 0.0294 |
| 37 | Cape Verde | 0.0197 |
| 38 | Cayman Islands | 0.0188 |
| 39 | Central African Republic | 0.0026 |
| 40 | Chad | NA |
| 41 | Chile | 0.0169 |
| 42 | China | 0.0180 |
| 43 | Colombia | 0.0149 |
| 44 | Comoros | 0.0068 |
| 45 | Congo | 0.0024 |
| 46 | Cook Islands | 0.0071 |
| 47 | Costa Rica | 0.0190 |
| 48 | Cote d’Ivoire | 0.0038 |
| 49 | Croatia | 0.0096 |
| 50 | Cuba | 0.0272 |
| 51 | Curacao | 0.0223 |
| 52 | Cyprus | 0.0193 |
| 53 | Czechia | 0.0172 |
| 54 | Democratic Republic of Congo | NA |
| 55 | Denmark | 0.0236 |
| 56 | Djibouti | 0.0024 |
| 57 | Dominica | 0.0040 |
| 58 | Dominican Republic | 0.0170 |
| 59 | Ecuador | 0.0232 |
| 60 | Egypt | 0.0096 |
| 61 | El Salvador | 0.0178 |
| 62 | Equatorial Guinea | 0.0036 |
| 63 | Estonia | 0.0131 |
| 64 | Eswatini | 0.0076 |
| 65 | Ethiopia | 0.0026 |
| 66 | Faeroe Islands | 0.0219 |
| 67 | Falkland Islands | 0.0087 |
| 68 | Fiji | 0.0236 |
| 69 | Finland | 0.0209 |
| 70 | France | 0.0202 |
| 71 | French Polynesia | 0.0118 |
| 72 | Gabon | 0.0020 |
| 73 | Gambia | 0.0027 |
| 74 | Georgia | 0.0082 |
| 75 | Germany | 0.0206 |
| 76 | Ghana | 0.0071 |
| 77 | Gibraltar | 0.0523 |
| 78 | Greece | 0.0150 |
| 79 | Greenland | 0.0267 |
| 80 | Grenada | 0.0050 |
| 81 | Guatemala | 0.0107 |
| 82 | Guernsey | 0.0214 |
| 83 | Guinea | 0.0038 |
| 84 | Guinea-Bissau | 0.0088 |
| 85 | Guyana | 0.0087 |
| 86 | Haiti | 0.0004 |
| 87 | Honduras | 0.0133 |
| 88 | Hong Kong | 0.0131 |
| 89 | Hungary | 0.0256 |
| 90 | Iceland | 0.0308 |
| 91 | India | 0.0119 |
| 92 | Indonesia | 0.0124 |
| 93 | Iran | 0.0277 |
| 94 | Iraq | 0.0050 |
| 95 | Ireland | 0.0216 |
| 96 | Isle of Man | 0.0369 |
| 97 | Israel | 0.0229 |
| 98 | Italy | 0.0207 |
| 99 | Jamaica | 0.0050 |
| 100 | Japan | 0.0255 |
| 101 | Jersey | 0.0132 |
| 102 | Jordan | 0.0093 |
| 103 | Kazakhstan | 0.0111 |
| 104 | Kenya | 0.0024 |
| 105 | Kiribati | 0.0191 |
| 106 | Kuwait | 0.0145 |
| 107 | Kyrgyzstan | 0.0032 |
| 108 | Laos | 0.0135 |
| 109 | Latvia | 0.0117 |
| 110 | Lebanon | 0.0056 |
| 111 | Lesotho | 0.0113 |
| 112 | Liberia | 0.0058 |
| 113 | Libya | 0.0057 |
| 114 | Liechtenstein | 0.0185 |
| 115 | Lithuania | 0.0152 |
| 116 | Luxembourg | 0.0199 |
| 117 | Macao | 0.0140 |
| 118 | Madagascar | 0.0007 |
| 119 | Malawi | 0.0009 |
| 120 | Malaysia | 0.0324 |
| 121 | Maldives | 0.0198 |
| 122 | Mali | 0.0011 |
| 123 | Malta | 0.0236 |
| 124 | Mauritania | 0.0074 |
| 125 | Mauritius | 0.0214 |
| 126 | Mexico | 0.0132 |
| 127 | Moldova | 0.0060 |
| 128 | Monaco | 0.0105 |
| 129 | Mongolia | 0.0394 |
| 130 | Montenegro | 0.0089 |
| 131 | Montserrat | 0.0020 |
| 132 | Morocco | 0.0114 |
| 133 | Mozambique | 0.0108 |
| 134 | Myanmar | 0.0078 |
| 135 | Namibia | 0.0033 |
| 136 | Nauru | 0.1525 |
| 137 | Nepal | 0.0106 |
| 138 | Netherlands | 0.0248 |
| 139 | New Caledonia | 0.0119 |
| 140 | New Zealand | 0.0250 |
| 141 | Nicaragua | 0.0432 |
| 142 | Niger | 0.0012 |
| 143 | Nigeria | 0.0013 |
| 144 | Niue | 0.0085 |
| 145 | North Macedonia | 0.0118 |
| 146 | Norway | 0.0201 |
| 147 | Oman | 0.0214 |
| 148 | Pakistan | 0.0088 |
| 149 | Palestine | 0.0077 |
| 150 | Panama | 0.0218 |
| 151 | Papua New Guinea | 0.0007 |
| 152 | Paraguay | 0.0178 |
| 153 | Peru | 0.0167 |
| 154 | Philippines | 0.0124 |
| 155 | Pitcairn | NA |
| 156 | Poland | 0.0157 |
| 157 | Portugal | 0.0234 |
| 158 | Qatar | 0.0175 |
| 159 | Romania | 0.0131 |
| 160 | Russia | 0.0080 |
| 161 | Rwanda | 0.0184 |
| 162 | Saint Helena | 0.1439 |
| 163 | Saint Kitts and Nevis | 0.0127 |
| 164 | Saint Lucia | 0.0034 |
| 165 | Saint Vincent and the Grenadines | 0.0026 |
| 166 | Samoa | 0.0129 |
| 167 | San Marino | 0.0403 |
| 168 | Sao Tome and Principe | 0.0091 |
| 169 | Saudi Arabia | 0.0144 |
| 170 | Senegal | 0.0015 |
| 171 | Serbia | 0.0112 |
| 172 | Seychelles | 0.0392 |
| 173 | Sierra Leone | 0.0039 |
| 174 | Singapore | 0.0248 |
| 175 | Sint Maarten (Dutch part) | 0.0141 |
| 176 | Slovakia | 0.0115 |
| 177 | Slovenia | 0.0126 |
| 178 | Solomon Islands | 0.0065 |
| 179 | Somalia | 0.0022 |
| 180 | South Africa | 0.0081 |
| 181 | South Korea | 0.0233 |
| 182 | South Sudan | 0.0009 |
| 183 | Spain | 0.0226 |
| 184 | Sri Lanka | 0.0295 |
| 185 | Sudan | 0.0019 |
| 186 | Suriname | 0.0112 |
| 187 | Sweden | 0.0189 |
| 188 | Switzerland | 0.0169 |
| 189 | Syria | 0.0027 |
| 190 | Taiwan | 0.0234 |
| 191 | Tajikistan | 0.0095 |
| 192 | Tanzania | NA |
| 193 | Thailand | 0.0198 |
| 194 | Timor | 0.0111 |
| 195 | Togo | 0.0033 |
| 196 | Tokelau | NA |
| 197 | Tonga | 0.0127 |
| 198 | Trinidad and Tobago | 0.0145 |
| 199 | Tunisia | 0.0146 |
| 200 | Turkey | 0.0156 |
| 201 | Turkmenistan | NA |
| 202 | Turks and Caicos Islands | 0.0116 |
| 203 | Tuvalu | 0.0153 |
| 204 | Uganda | 0.0085 |
| 205 | Ukraine | 0.0075 |
| 206 | United Arab Emirates | 0.0137 |
| 207 | United Kingdom | 0.0161 |
| 208 | United States | 0.0140 |
| 209 | Uruguay | 0.0258 |
| 210 | Uzbekistan | 0.0118 |
| 211 | Vanuatu | 0.0107 |
| 212 | Venezuela | 0.0173 |
| 213 | Vietnam | 0.0297 |
| 214 | Wallis and Futuna | 0.0046 |
| 215 | Yemen | 0.0002 |
| 216 | Zambia | 0.0026 |
| 217 | Zimbabwe | 0.0057 |
Figure A.3: Boxplot of VRI values
| Time.period | MSE | R_squared |
|---|---|---|
| 0 to 4 months | 0.0000147 | 0.7615992 |
| 4 to 8 months | 0.0000131 | 0.8260440 |
| 8 months to current | 0.0000315 | 0.7924668 |
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8.1 x64 (build 9600)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252
## [2] LC_CTYPE=English_United Kingdom.1252
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] permimp_1.0-2 deSolve_1.32 reshape2_1.4.4
## [4] broom_0.8.0 caret_6.0-92 lattice_0.20-45
## [7] ranger_0.13.1 randomForest_4.7-1 qtlcharts_0.16
## [10] lubridate_1.8.0 viridis_0.6.2 viridisLite_0.4.0
## [13] maps_3.4.0 ggthemes_4.2.4 readxl_1.4.0
## [16] plotly_4.10.0 DT_0.22 kableExtra_1.3.4.9000
## [19] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.9
## [22] purrr_0.3.4 readr_2.1.2 tidyr_1.2.0
## [25] tibble_3.1.6 ggplot2_3.3.5 tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] TH.data_1.1-1 colorspace_2.0-3 modeltools_0.2-23
## [4] ellipsis_0.3.2 class_7.3-20 fs_1.5.2
## [7] proxy_0.4-26 rstudioapi_0.13 farver_2.1.0
## [10] listenv_0.8.0 bit64_4.0.5 mvtnorm_1.1-3
## [13] prodlim_2019.11.13 fansi_1.0.3 coin_1.4-2
## [16] xml2_1.3.3 codetools_0.2-18 splines_4.1.3
## [19] libcoin_1.0-9 knitr_1.39 jsonlite_1.8.0
## [22] pROC_1.18.0 dbplyr_2.1.1 compiler_4.1.3
## [25] httr_1.4.2 backports_1.4.1 assertthat_0.2.1
## [28] Matrix_1.4-0 fastmap_1.1.0 lazyeval_0.2.2
## [31] strucchange_1.5-2 cli_3.3.0 htmltools_0.5.2
## [34] tools_4.1.3 qtl_1.50 gtable_0.3.0
## [37] glue_1.6.2 Rcpp_1.0.8.3 cellranger_1.1.0
## [40] jquerylib_0.1.4 vctrs_0.4.1 svglite_2.1.0
## [43] nlme_3.1-155 crosstalk_1.2.0 iterators_1.0.14
## [46] timeDate_3043.102 xfun_0.30 gower_1.0.0
## [49] globals_0.14.0 rvest_1.0.2 lifecycle_1.0.1
## [52] future_1.25.0 zoo_1.8-10 MASS_7.3-55
## [55] scales_1.2.0 ipred_0.9-12 vroom_1.5.7
## [58] hms_1.1.1 sandwich_3.0-1 parallel_4.1.3
## [61] yaml_2.3.5 gridExtra_2.3 sass_0.4.1
## [64] rpart_4.1.16 stringi_1.7.6 highr_0.9
## [67] foreach_1.5.2 hardhat_0.2.0 lava_1.6.10
## [70] matrixStats_0.62.0 rlang_1.0.2 pkgconfig_2.0.3
## [73] systemfonts_1.0.4 evaluate_0.15 labeling_0.4.2
## [76] recipes_0.2.0 htmlwidgets_1.5.4 bit_4.0.4
## [79] tidyselect_1.1.2 parallelly_1.31.1 plyr_1.8.7
## [82] magrittr_2.0.3 bookdown_0.26 R6_2.5.1
## [85] generics_0.1.2 multcomp_1.4-19 DBI_1.1.2
## [88] pillar_1.7.0 haven_2.5.0 withr_2.5.0
## [91] survival_3.2-13 nnet_7.3-17 future.apply_1.9.0
## [94] modelr_0.1.8 crayon_1.5.1 utf8_1.2.2
## [97] party_1.3-10 tzdb_0.3.0 rmarkdown_2.14
## [100] grid_4.1.3 data.table_1.14.2 ModelMetrics_1.2.2.2
## [103] reprex_2.0.1 digest_0.6.29 webshot_0.5.3
## [106] stats4_4.1.3 munsell_0.5.0 bslib_0.3.1
library(tidyverse)
library(kableExtra)
library(DT)
library(plotly)
library(readxl)
library(ggthemes)
library(maps)
library(viridis) # Color gradients
library(lubridate)
library(qtlcharts) # Interactive scatter plots
library(randomForest)
library(ranger) # a faster implementation of randomForest
library(caret) # performe many machine learning models
library(broom) # try: `install.packages("backports")` if diretly installing broom gives an error
library(reshape2) # to use melt()
library(deSolve)
library(permimp) # for calculating conditional importance in a speedy way
covid_data = read_csv("data/covid_data_latest.csv")
covid_data$date = ymd(covid_data$date)
policy = read_csv("data/covid-vaccination-policy.csv")
policy$Day = ymd(policy$Day)
happiness = read_csv("data/happiness-cantril-ladder.csv")
ghs = read_csv("data/2021-GHS-Index-April-2022.csv")
corruption = read_xlsx("data/CPI2020_GlobalTablesTS_210125.xlsx", sheet=1, skip=2)
source("data/ct_model.R")
# smoother function, returns smoothed column
Lowess <- function(data, f) {
lowess_fit <- lowess(data, f = f)
return(lowess_fit$y)
}
lag_covid = covid_data
lag_covid = lag_covid %>%
filter(population > 1500000) %>%
filter(gdp_per_capita > 0)
countrys <- unique(lag_covid$location)
deleted <- c("Afghanistan", "Antigua and Barbuda", "Bangladesh","Benin", "Bhutan", "Bonaire Sint Eustatius and Saba", "Botswana", "Burundi","Burkina Faso", "Cameroon", "Cote d'Ivoire", "Democratic Republic of Congo", "Ethiopia","Eritrea", "Gabon", "Ghana", "Guernsey", "Guinea", "Kenya", "Kuwait", "Liberia", "Laos", "Namibia", "Nepal","Nicaragua", "Niger", "Nigeria", "Palestine", "Philippines", "Pitcairn", "Rwanda", "Saint Helena", "Senegal", "Sierra Leone", "Somalia", "South Sudan", "Sudan", "Tokelau", "Turkmenistan","Tanzania", "Uganda","Yemen", "World", "Zambia")
countrys = countrys[! countrys %in% deleted]
lag_covid = select(lag_covid, "date", "location", "people_vaccinated_per_hundred", "people_fully_vaccinated_per_hundred")
start_date = "2021-02-01"
end_date = "2021-08-01"
lag_covid = lag_covid %>% filter(date >= start_date & date < end_date)
lag_covid$people_vaccinated_per_hundred[is.na(lag_covid$people_vaccinated_per_hundred)] = 0
lag_covid$people_fully_vaccinated_per_hundred[is.na(lag_covid$people_fully_vaccinated_per_hundred)] = 0
lagValue <- function(FirstDose, SecondDose, windowsize, p)
{
# vector for all measures of distance between matrices
dist_vector = c()
i = 1
while (i <= windowsize){
# select different subsets of matrices, calculate the distances between the 2 matrices and store the distance. This while loop will contain information for 1st vaccine lag
FirstDose_subset <- FirstDose[i:nrow(FirstDose),1]
SecondDose_subset <- SecondDose[1:(1 - i + nrow(SecondDose)),1]
dist_FirstDose <- proxy::dist(t(FirstDose_subset), t(SecondDose_subset), method = "Minkowski", p = p)
dist_vector = c(dist_vector, dist_FirstDose)
i = i + 1
}
j = 1
while (j <= windowsize){
# select different subsets of matrices, calculate the distances between the 2 matrices and store the distance. This while loop will contain information for 2nd vaccine lag
FirstDose_subset1 <- FirstDose[1:(1 - j + nrow(FirstDose)),1]
SecondDose_subset1 <- SecondDose[j:nrow(SecondDose),1]
dist_SecondDose <- proxy::dist(t(FirstDose_subset1), t(SecondDose_subset1), method = "Minkowski", p = p)
dist_vector = c(dist_vector, dist_SecondDose)
j = j + 1
}
# select min value index which corresponds to value of the lag
return(which.min(dist_vector))
}
lag_vector_1 <- c()
lag_vector_2 <- c()
lag_vector_3 <- c()
# loop through each country and calculate 3 time lags: manhattan distance, euclidean distance and minkowski (p=3) time lags
p = 1
while (p <= 3){
z = 1
while (z <= length(countrys)){
# only select records for certain country and only select 1st and 2nd vaccine columns
lagCovid_filtered = filter(lag_covid, location == countrys[z])
combined_matrix <- cbind(lagCovid_filtered[,3], lagCovid_filtered[,4])
# In the dataset, there are missing values. Will replace these missing values (0) with the value from the date before. Do it for both 1st and 2nd vaccine columns
for (i in 1:nrow(combined_matrix)){
if (i == 1){
} else{
if (combined_matrix[i,1] == 0){
combined_matrix[i,1] = combined_matrix[i-1, 1]
}
}
}
for (j in 1:nrow(combined_matrix)){
if (j == 1){
} else{
if (combined_matrix[j,2] == 0){
combined_matrix[j,2] = combined_matrix[j-1, 2]
}
}
}
# Apply smoothing function to 1st and 2nd vaccine columns. f = 0.15 is an arbitrary value
combined_matrix_smooth<- as.matrix(apply(combined_matrix, 2, Lowess, f = 0.15))
# Store each column separately as individual matrices
FirstDose_matrix = as.matrix(combined_matrix_smooth[,1])
SecondDose_matrix = as.matrix(combined_matrix_smooth[,2])
# Input the individual matrices into the lagValue function to find the lag between the 1st and 2nd dose for a particular country
lag <- lagValue(FirstDose_matrix, SecondDose_matrix, windowsize=100, p)
#store value of lag
if (p == 1){
lag_vector_1 <- c(lag_vector_1, lag)
} else if (p == 2){
lag_vector_2 <- c(lag_vector_2, lag)
} else {
lag_vector_3 <- c(lag_vector_3, lag)
}
z = z + 1
}
p = p + 1
}
# label the lag values with the corresponding country
names(lag_vector_1) <- countrys
names(lag_vector_2) <- countrys
names(lag_vector_3) <- countrys
# function to explain the time lag value
lagType <- function(lag, windowsize)
{ # Function to convert indice value given by lagValue to a value for the Time Lag.
# Any lag values that are greater than windowsize were part of the 2nd half of the 'dist_vector' from the lagValue function, the half of the vector for the 2nd vaccine lag.
# Therefore need to subtract off all the days from the 1st half of the 'dist_vector' to get number of days for 2nd vaccine lag.
# No such issue for 1st vaccine lag as all values are within first half.
if (lag > windowsize){
return(c(LagType = "Second Dose Lag", Lag = lag - windowsize - 1))
} else {
return(c(LagType = "First Dose Lag", Lag = lag - 1))
}
}
# Apply function to each country's Time lag value
lag_df_1 = mapply(lagType, lag = lag_vector_1, windowsize = 100)
lag_df_2 = mapply(lagType, lag = lag_vector_2, windowsize = 100)
lag_df_3 = mapply(lagType, lag = lag_vector_3, windowsize = 100)
# Visualise Time lags
total_lag = cbind(t(lag_df_1), t(lag_df_2), t(lag_df_3))
colnames(total_lag) = c("LagType", "Lag: Euclidean distance", "LagType", "Lag: Manhattan distance", "LagType", "Lag: Minkowksi distance (P=3)")
# Country list
countries = c("United States", "India", "Brazil", "France", "United Kingdom", "Russia",
"Turkey", "Italy", "Germany", "Spain", "Argentina",
"Iran", "Colombia", "Poland", 'Mexico', "Netherlands",
"Indonesia", "Ukraine", "South Africa", "Philippines", "Peru", "Belgium",
"Czechia", "Japan", "Iran")
# Select certain countreis
policy_data <- covid_data %>%
filter(location %in% countries)
policy <- policy %>% filter(Entity %in% countries)
policy_data <- policy_data %>%
select(iso_code, continent, location, date,new_people_vaccinated_smoothed_per_hundred, new_people_vaccinated_smoothed, new_vaccinations, new_vaccinations_smoothed, new_vaccinations_smoothed_per_million, population, people_vaccinated)
# Merge the policy data with the COVID data
policy_data <- merge(policy_data, policy, by.x = c('iso_code', "date", "location"), by.y = c('Code', 'Day', "Entity"))
#The 'get slope' function was created to calculate the slope of each country's vaccination policy from stage 1 to stage 5. The slope is calculated using the lm function. This is then extracted and stored in a slope. The slope is essentially the speed of vaccination uptake with respect to the vaccination policy, which is then compared to other countries in a graph.
get_slope <- function(country_data, country_policy, country) {
i = 1
country_slope <- data.frame(policy = as.factor(1:5), speed = rep(0, 5), country = rep(country, 5))
while (i <= length(country_policy$vaccination_policy)) {
# Select current policy
current_policy = country_policy$vaccination_policy[i]
if (current_policy > 0 & !is.na(country_policy$people_vaccinated[country_policy$vaccination_policy == current_policy])) {
# Extract the number of people vaccinated
temp_people <- country_data$people_vaccinated[country_data$vaccination_policy == current_policy & !is.na(country_data$people_vaccinated)] / country_data$population[1]
temp_index <- 1:length(temp_people)
if (length(temp_index) > 1) {
# Linear model
mod <- lm(temp_people ~ temp_index)
# Extract and store the slope
country_slope$speed[country_slope$policy == current_policy] <- summary(mod)$coefficients[2,1]
} else {
# Some countries only have a policy for 1 day -> treat the slope as 0
country_slope$speed[country_slope$policy == current_policy] <- 0
}
}
i <- i + 1
}
return(country_slope)
}
country_slope = NULL
#loop through the countries in the dataset and subset the required variables in another dataframe
for (country in countries) {
country_data <- policy_data[policy_data$location == country, ]
country_policy <- country_data[!is.na(country_data$people_vaccinated), ] %>%
group_by(vaccination_policy) %>%
arrange(date) %>%
slice(1L) %>%
select(iso_code, location, date, vaccination_policy, new_vaccinations_smoothed, population, people_vaccinated)
#Apply the 'get_slope' function to calculate the speed of each country then store in another vector
temp_slope <- get_slope(country_data, country_policy, country)
country_slope <- rbind(country_slope, temp_slope)
}
country_slope$policy <- as.factor(country_slope$policy)
# plot the graph
global <- ggplot(data = country_slope, aes(x = country, y = speed, fill = policy)) +
geom_bar(stat="identity", position=position_dodge()) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
scale_y_sqrt() +
ggtitle("All countries") +
xlab("Countries") +
ylab("Slope") +
labs(fill = "Avalability stage")
ggplotly(global) %>%
layout(autosize = F, width = '100%', height = '100%')
# Example of use, inspect the r_list to see what's included
r_list1 = ct_model(covid_data, log.y = FALSE, model = "logis")
r_list2 = ct_model(covid_data, log.y = TRUE, model = "asymp")
vri_data <- data.frame(bind_rows(r_list1$vri_data))
# Merge the datasets
corruption <- corruption %>% select(c("ISO3", "CPI score 2020"))
happiness <- happiness %>% group_by(Code) %>%
arrange(Year) %>%
filter(!is.na(`Life satisfaction in Cantril Ladder (World Happiness Report 2021)`)) %>%
dplyr::summarise(satisfaction=last(`Life satisfaction in Cantril Ladder (World Happiness Report 2021)`))
joint_data <- merge(vri_data,corruption, by.x=c("iso_code"), by.y=c("ISO3"))
joint_data <- merge(joint_data,happiness, by.x=c("iso_code"), by.y=c("Code"))
# GHE-INDEX
ghs <- ghs %>% filter(Year == 2021) %>% select(Country,`OVERALL SCORE`)
## adding iso code to ghs dataset
ghs <- ghs %>% mutate(iso_code = iso.alpha(Country, 3), .before = 1)
ghs <- ghs %>% mutate(
iso_code = case_when(
!is.na(iso_code) ~ iso_code,
Country == "Bosnia and Hercegovina" ~ "BIH",
Country == "Cabo Verde" ~ "CPV",
Country == "Congo (Brazzaville)" ~ "COG",
Country == "Congo (Democratic Republic)" ~ "COD",
Country == "Côte d'Ivoire" ~ "CIV",
Country == "eSwatini" ~ "BIH",
Country == "Kyrgyz Republic" ~ "KGZ",
Country == "São Tomé and Príncipe" ~ "STP",
Country == "St Kitts & Nevis" ~ "KNA",
Country == "St Lucia" ~ "LCA",
Country == "St Vincent & The Grenadines" ~ "VCT",
Country == "United Kingdom" ~ "GBR",
Country == "United States of America" ~ "USA"
)
)
joint_data <- merge(joint_data, ghs[,-2], by.x=c("iso_code"), by.y=c("iso_code"))
colnames(joint_data)[16] <- "GHS_score"
# take log for highly skewed variables
vri_extra_logged = joint_data %>% select(-population) %>%
mutate(
across(.cols = c(population_density, aged_65_older, gdp_per_capita, hospital_beds_per_thousand, cardiovasc_death_rate, extreme_poverty, diabetes_prevalence),
~log(.))
) %>%
rename_with(.cols = c(population_density, aged_65_older, gdp_per_capita, hospital_beds_per_thousand, cardiovasc_death_rate, extreme_poverty, diabetes_prevalence),
.fn = ~paste0("log_", .))
min_max_norm <- function(x) {
(x - min(x,na.rm = TRUE)) / (max(x,na.rm = TRUE) - min(x,na.rm = TRUE))
}
scaled_data <- data.frame(lapply(vri_extra_logged[4:15], min_max_norm), vri = vri_extra_logged$vri)
scaled_data_cleaned <- scaled_data %>% filter( !is.na(vri) & !is.na(log_population_density)&
!is.na(log_gdp_per_capita) &!is.na(log_aged_65_older)&
!is.na(log_extreme_poverty)&
!is.na(human_development_index)&
!is.na(log_cardiovasc_death_rate)&
!is.na(log_hospital_beds_per_thousand)) %>%
select(c(vri,log_gdp_per_capita,log_aged_65_older,log_population_density,
CPI.score.2020,log_extreme_poverty,
satisfaction,life_expectancy,human_development_index,
log_cardiovasc_death_rate,log_diabetes_prevalence,
log_hospital_beds_per_thousand, GHS_score))
# Appendix
spearman <- cor(scaled_data_cleaned, use="pairwise.complete.obs", method="spearman")
heatmap <- qtlcharts::iplotCorr(
scaled_data_cleaned,
reorder=TRUE,
corr = spearman,
chartOpts = list(cortitle = "Spearman's correlation")
)
# Remove population_density based on scatter plot and correlation heatmap
scaled_data_cleaned <- scaled_data_cleaned %>% select(-log_population_density)
# hyper parameter grid search
## The importance results are produced on mac
mtry <- seq(2, 6, by = 1)
num_trees <- c(100,150,200,250,300,350,400,450,500)
# Manual Search
#control <- trainControl(method="cv", number=3, search="grid")
grid_param <- expand.grid(.mtry=mtry)
modellist <- list()
set.seed(3888)
for (ntree in num_trees) {
fit <- train( vri~.,
data= scaled_data_cleaned,
method='rf',
tuneGrid=grid_param,
ntree= ntree,
importance=TRUE,
trControl=trainControl(method='cv',
number=5) )
key <- toString(ntree)
modellist[[key]] <- fit$finalModel
}
## COMPARE RESULTS
lowest_mse <- 1
model_mse <- modellist[[1]]
highest_r2 <- 0
model_r2 <- modellist[[1]]
for (i in c(1:length(modellist))) {
result <- predict(modellist[[i]], newdata=scaled_data_cleaned %>% select(-vri))
result_avg <- mean(scaled_data_cleaned$vri)
mse = mean((scaled_data_cleaned$vri - result)^2)
r2 = 1 - (sum((scaled_data_cleaned$vri - result)^2))/(sum((scaled_data_cleaned$vri - result_avg)^2))
if (highest_r2 < r2){
highest_r2 = r2
model_r2 = modellist[[i]]
}
if (lowest_mse > mse) {
lowest_mse = mse
model_mse = modellist[[i]]
}
}
# Calculating importance of features to the model: Handling multicollinear features by using conditional permutation
## https://cran.r-project.org/web/packages/permimp/vignettes/permimp-package.html
set.seed(3888)
optimal_rf <- randomForest(vri ~ ., data= scaled_data_cleaned, mtry = 2, keep.forest = TRUE, keep.inbag = FALSE, ntree= 300)
## compute permutation importance
rf_permimp <- permimp(optimal_rf, progressBar = FALSE, conditional = TRUE,scaled = TRUE, thresholdDiagnostics = TRUE,type="response",do_check = FALSE)
vimp <- as.data.frame(rf_permimp$values)
colnames(vimp)[1] <- "importance"
vimp <- round(vimp, 4)
vimp$var <- row.names(vimp)
options(scipen=999)
vimp <- vimp %>%
arrange(desc(importance)) %>%
slice(1:5)
vimp <- vimp %>%
ggplot(aes(reorder(var,importance), importance)) +
geom_col(fill="steelblue") +
coord_flip() +
theme_bw()+
ggtitle("Top 5 important variables") +
xlab("Factors in order") +
ylab("Scaled importance")
## The permimp() function is sensitive to operating systems, so we save the results of importance
# png(filename="importance.png")
# vimp
# dev.off()
# datatable(total_lag, options = list(columnDefs = list(list(className = 'dt-center', targets = "_all"))), caption = "List of different time lags measures for each country.")
total_lag %>%
kbl(caption = "List of different time lags measures for each country.",
format = "html",
booktabs = TRUE,
row.names = FALSE,
escape = TRUE,
align = "c") %>%
kable_styling(bootstrap_options = c("condensed", "striped", "hover"),
position = "center") %>%
scroll_box(height = "400px")
# Box plots of time lags for all 3 types of time lags measurements
total_lag_df <- as.data.frame(total_lag)
colnames(total_lag_df) = c("LagType", "LagEuclideanDistance", "LagType", "LagManhattanDistance","LagType", "LagMinkowskiDistanceP3")
box <- plot_ly(type = "box")
box <- box %>% add_boxplot(x = total_lag_df$LagManhattanDistance, boxpoints = "all", jitter = 0.3, name = "Manhattan Distance")
box <- box %>% add_boxplot(x = total_lag_df$LagEuclideanDistance, boxpoints = "all", jitter = 0.3, name = "Euclidean Distance")
box <- box %>% add_boxplot(x = total_lag_df$LagMinkowskiDistanceP3, boxpoints = "all", jitter = 0.3, name = "Minkowski Distance (P=3)")
box <- box %>% layout(title = "Box Plot of Lag Between 1st and 2nd Dose", autosize = F, width = '100%', height = '100%')
box
#VRI
start <- min(covid_data$date[!is.na(covid_data$people_vaccinated)])
end <- start %m+% months(4)
r_list4 = ct_model(covid_data[start <= covid_data$date & covid_data$date < end, ], log.y = FALSE, model = "logis")
temp_data <- data.frame(bind_rows(r_list4$vri_data))
temp_data <- data.frame(vri4 = temp_data$vri, temp_data %>% select(-vri))
vri_eva_data <- temp_data
start <- min(covid_data$date[!is.na(covid_data$people_vaccinated)])
start <- start %m+% months(4)
end <- start %m+% months(8)
r_list8 = ct_model(covid_data[start <= covid_data$date & covid_data$date < end, ], log.y = FALSE, model = "logis")
temp_data <- data.frame(bind_rows(r_list8$vri_data))
temp_data <- data.frame(vri8 = temp_data$vri, location = temp_data$location)
vri_eva_data <- merge(vri_eva_data, temp_data, by = "location")
start <- min(covid_data$date[!is.na(covid_data$people_vaccinated)])
start <- start %m+% months(8)
r_listf = ct_model(covid_data[start <= covid_data$date & covid_data$date < end, ], log.y = FALSE, model = "logis")
temp_data <- data.frame(bind_rows(r_listf$vri_data))
temp_data <- data.frame(vrif = temp_data$vri, location = temp_data$location)
vri_eva_data <- merge(vri_eva_data, temp_data, by = "location")
# Merge the datasets
joint_data <- merge(vri_eva_data,corruption, by.x=c("iso_code"), by.y=c("ISO3"))
joint_data <- merge(joint_data,happiness, by.x=c("iso_code"), by.y=c("Code"))
ghs <- ghs %>% mutate(
iso_code = case_when(
!is.na(iso_code) ~ iso_code,
Country == "Bosnia and Hercegovina" ~ "BIH",
Country == "Cabo Verde" ~ "CPV",
Country == "Congo (Brazzaville)" ~ "COG",
Country == "Congo (Democratic Republic)" ~ "COD",
Country == "Côte d'Ivoire" ~ "CIV",
Country == "eSwatini" ~ "BIH",
Country == "Kyrgyz Republic" ~ "KGZ",
Country == "São Tomé and Príncipe" ~ "STP",
Country == "St Kitts & Nevis" ~ "KNA",
Country == "St Lucia" ~ "LCA",
Country == "St Vincent & The Grenadines" ~ "VCT",
Country == "United Kingdom" ~ "GBR",
Country == "United States of America" ~ "USA"
)
)
joint_data <- merge(joint_data, ghs[,-2], by.x=c("iso_code"), by.y=c("iso_code"))
colnames(joint_data)[18] <- "GHS_score"
# take log for highly skewed variables
vri_extra_logged = joint_data %>% select(-population) %>%
mutate(
across(.cols = c(population_density, aged_65_older, gdp_per_capita, hospital_beds_per_thousand, cardiovasc_death_rate, extreme_poverty, diabetes_prevalence),
~log(.))
) %>%
rename_with(.cols = c(population_density, aged_65_older, gdp_per_capita, hospital_beds_per_thousand, cardiovasc_death_rate, extreme_poverty, diabetes_prevalence),
.fn = ~paste0("log_", .))
min_max_norm <- function(x) {
(x - min(x,na.rm = TRUE)) / (max(x,na.rm = TRUE) - min(x,na.rm = TRUE))
}
scaled_data <- data.frame(lapply(vri_extra_logged %>% select(-iso_code, -location, -vri4, -vri8, -vrif), min_max_norm), vri4 = vri_extra_logged$vri4, vri8 = vri_extra_logged$vri8, vrif = vri_extra_logged$vrif)
# Removing NAs:
scaled_data_cleaned <- scaled_data %>% filter(!is.na(vri4) &
!is.na(vri8) &
!is.na(vrif) &
!is.na(log_population_density) &
!is.na(log_gdp_per_capita) &
!is.na(log_aged_65_older)&
!is.na(log_extreme_poverty)&
!is.na(human_development_index)&
!is.na(log_cardiovasc_death_rate)&
!is.na(log_hospital_beds_per_thousand)) %>%
select(c(vri4, vri8, vrif, log_gdp_per_capita,log_aged_65_older,log_population_density,CPI.score.2020,
log_extreme_poverty,satisfaction,life_expectancy,human_development_index,
log_cardiovasc_death_rate,log_diabetes_prevalence,log_hospital_beds_per_thousand, GHS_score))
# hyper parameter grid search (definitely need a bit modify)
mtry <- seq(2, 12, by = 1)
num_trees <- c(100,150,200,250,300,350,400,450,500)
timeperiod <- c()
# VRI between 0 to 4 months
# Manual Search
grid_param <- expand.grid(.mtry=mtry)
modellist <- list()
for (ntree in num_trees) {
fit <- train( vri4~.,
data= scaled_data_cleaned %>% select(-vri8, -vrif),
method='rf',
tuneGrid=grid_param,
ntree= ntree,
trControl=trainControl(method='cv',
number=3) )
key <- toString(ntree)
modellist[[key]] <- fit$finalModel
}
## COMPARE RESULTS
lowest_mse_4 <- 1
model_mse_4 <- modellist[[1]]
highest_r2_4 <- 0
model_r2_4 <- modellist[[1]]
for (i in c(1:length(modellist))) {
result <- predict(modellist[[i]], newdata=scaled_data_cleaned %>% select(-vri4, -vri8, -vrif))
result_avg <- mean(scaled_data_cleaned$vri4)
mse = mean((scaled_data_cleaned$vri4 - result)^2)
r2 = 1 - (sum((scaled_data_cleaned$vri4 - result)^2))/(sum((scaled_data_cleaned$vri4 - result_avg)^2))
if (highest_r2_4 < r2){
highest_r2_4 = r2
model_r2_4 = modellist[[i]]
}
if (lowest_mse_4 > mse) {
lowest_mse_4 = mse
model_mse_4 = modellist[[i]]
}
}
# VRI between 4 to 8 months
# Manual Search
grid_param <- expand.grid(.mtry=mtry)
modellist <- list()
for (ntree in num_trees) {
fit <- train( vri8~.,
data= scaled_data_cleaned %>% select(-vri4, -vrif),
method='rf',
tuneGrid=grid_param,
ntree= ntree,
trControl=trainControl(method='cv',
number=3) )
key <- toString(ntree)
modellist[[key]] <- fit$finalModel
}
## COMPARE RESULTS
lowest_mse_8 <- 1
model_mse_8 <- modellist[[1]]
highest_r2_8 <- 0
model_r2_8 <- modellist[[1]]
for (i in c(1:length(modellist))) {
result <- predict(modellist[[i]], newdata=scaled_data_cleaned %>% select(-vri4, -vri8, -vrif))
result_avg <- mean(scaled_data_cleaned$vri8)
mse = mean((scaled_data_cleaned$vri8 - result)^2)
r2 = 1 - (sum((scaled_data_cleaned$vri8 - result)^2))/(sum((scaled_data_cleaned$vri8 - result_avg)^2))
if (highest_r2_8 < r2){
highest_r2_8 = r2
model_r2_8 = modellist[[i]]
}
if (lowest_mse_8 > mse) {
lowest_mse_8 = mse
model_mse_8 = modellist[[i]]
}
}
# VRI between 8 months to latest date
# Manual Search
grid_param <- expand.grid(.mtry=mtry)
modellist <- list()
for (ntree in num_trees) {
fit <- train( vrif~.,
data= scaled_data_cleaned %>% select(-vri8, -vri4),
method='rf',
tuneGrid=grid_param,
ntree= ntree,
trControl=trainControl(method='cv',
number=3) )
key <- toString(ntree)
modellist[[key]] <- fit$finalModel
}
## COMPARE RESULTS
lowest_mse_f <- 1
model_mse_f <- modellist[[1]]
highest_r2_f <- 0
model_r2_f <- modellist[[1]]
for (i in c(1:length(modellist))) {
result <- predict(modellist[[i]], newdata=scaled_data_cleaned %>% select(-vri4, -vri8, -vrif))
result_avg <- mean(scaled_data_cleaned$vrif)
mse = mean((scaled_data_cleaned$vrif - result)^2)
r2 = 1 - (sum((scaled_data_cleaned$vrif - result)^2))/(sum((scaled_data_cleaned$vrif - result_avg)^2))
if (highest_r2_f < r2){
highest_r2_f = r2
model_r2_f = modellist[[i]]
}
if (lowest_mse_f > mse) {
lowest_mse_f = mse
model_mse_f = modellist[[i]]
}
}
vri_evaluation <- data.frame(`Time period` = c("0 to 4 months", "4 to 8 months", "8 months to current"), MSE = c(lowest_mse_4, lowest_mse_8, lowest_mse_f), `R_squared` = c(highest_r2_4, highest_r2_8, highest_r2_f))
vri_eval_table <- vri_evaluation
var_list <- data.frame(colnames(joint_data)[5:16])
colnames(var_list) <- "Selected Variable Name"
var_list %>%
kbl(caption = "Selected variables for random forrest.",
format = "html",
booktabs = TRUE,
row.names = FALSE,
escape = TRUE) %>%
kable_styling(bootstrap_options = c("condensed", "striped", "hover"),
position = "center")
heatmap
df_sum <- vri_data %>% select(location, vri)
df_sum %>%
kbl(caption = "VRI results for all countries",
format = "html",
booktabs = TRUE,
row.names = TRUE,
escape = TRUE,
digits = 4) %>%
kable_styling(bootstrap_options = c("condensed", "striped", "hover"),
position = "center") %>%
scroll_box(height = "500px")
vri_bplot <- ggplot(vri_data,aes(x="",y=vri)) + geom_boxplot(outlier.colour="blue") + ylab("VRI") + xlab("") + ggtitle("Distribution of VRIs")
ggplotly(vri_bplot) %>%
layout(autosize = F, width = '100%', height = '100%')
vri_eval_table %>%
kbl(caption = "VRI time evaluation result",
format = "html",
booktabs = TRUE,
row.names = FALSE,
escape = TRUE) %>%
kable_styling(bootstrap_options = c("condensed", "striped", "hover"),
position = "center")
sessionInfo()
# only example-plot code chunk
If The document isn’t knitting it may be because of missing LFS dependency. Usage:
Step 1 : Install git LFS. On MAC using "brew install git-lfs: and for Windows, Git LFS is alreday installed in Git for Windows.
Step 2 : Once in the report directory in the command line type the command “git lfs install”
Step 3 : In the terminal, type “git lfs pull” and then knit.